Single cell characterization of myeloma and its precursor conditions reveals transcriptional signatures of early tumorigenesis

Multiple myeloma is a plasma cell malignancy almost always preceded by precursor conditions, but low tumor burden of these early stages has hindered the study of their molecular programs through bulk sequencing technologies. Here, we generate and analyze single cell RNA-sequencing of plasma cells from 26 patients at varying disease stages and 9 healthy donors. In silico dissection and comparison of normal and transformed plasma cells from the same bone marrow biopsy enables discovery of patient-specific transcriptional changes. Using Non-Negative Matrix Factorization, we discover 15 gene expression signatures which represent transcriptional modules relevant to myeloma biology, and identify a signature that is uniformly lost in abnormal cells across disease stages. Finally, we demonstrate that tumors contain heterogeneous subpopulations expressing distinct transcriptional patterns. Our findings characterize transcriptomic alterations present at the earliest stages of myeloma, providing insight into the molecular underpinnings of disease initiation.

dx to 1st FISH results (time of 1st iFISH assay, in days from diagnosis), time from dx to 2nd FISH results (time of 2nd iFISH assay, in days from diagnosis), M Protein when sample was taken (g/dL), BMPC % (% plasma cells in bone marrow biopsy), serum free light chain ratio (involved/uninvolved), 20/2/20 risk for SMM patients 2 , progression_to_mm (1=Has progressed to MM, 2=Has not progressed to MM, 3=MM was the original diagnosis), days till mm diagnosis (from the time of initial diagnosis), treated during MGUS/SMM (0=no, 1=yes), and follow up time per patient (days). Supplementary Data 3. Sample information. Quality metrics for scRNAseq samples, including whether the sample was fresh or frozen, batch ID, n cells retained after removing low quality cells and non-CD138+ cells (QC), and median UMI post-QC.

Supplementary Data 4. Cell level information.
Sample of origin, Leiden clustering assignment, normal/abnormal label, n genes detected, fraction mitochondrial reads, and n UMI detected per cell, for cells retained post-QC. Supplementary Data 5. DEGs between CD20+ and CD20-subclones in SMM-12. List of differential expression testing results for cells in SMM-12's CD20+ vs. CD20-subclones. A two-sided Wilcoxon rank sum test with Benjamini-Hochberg correction was used, and fold changes were calculated as described in Methods ("Within-patient differential expression testing"). Supplementary Data 6. List of 764 differentially expressed genes (|log(fold change)| > log(1.5); q<0.1) discovered for abnormal vs.
List of differentially expressed genes (two-sided Wilcoxon rank sum test with Bejamini-Hochberg correction, |log(fold change)| > log(1.5); q<0.1) discovered per-sample using our within-patient differential expression analysis, along with their fold changes and significance levels. Genes will were fresh or frozen prior to sequencing), sex, and age. b, The cluster assignments for cells from each sample, separately for normal cells (top) and abnormal cells (bottom), with normal/abnormal status determined using the per-sample clustering technique described in Methods. Arrows point to normal cells from patients that cluster together with normal cells from NBM, rather than together with the other cells from the same patient donor. This pattern suggests that the disease signal's influence on clustering is stronger than that of a potential batch effect. a, UMAP embedding of cells from sample SMM-12, which has 190 cells belonging to a CD20+ subclone (cluster on right) out of 284 total cells. Both clusters express CCND1, signifying that both harbor a t(11;14) translocation, but the cluster on the right expresses higher levels of CD20 (also known as MS4A1), CD27, CD74. b, Volcano plot of DEGs between the CD20+ and CD20subclones from sample SMM-12. Orange denotes a significant DEG (|log(fold change)| > log(1.5); q<0.1). The top genes, ranked by q-value, are annotated on the plot. c, UMAP embedding of cells from sample SMM-12, colored by expression of immunoglobulin constant region genes. Both subpopulations express a kappa light chain (encoded by the gene IGKC) and IgG heavy chain (encoded by genes named IGHG*).

Supplementary Figure 4
Expression of marker genes on cells from sample SMM-8 reveals separation of abnormal and normal plasma cells a b

Supplementary Figure 4. Example of clustering approach for labeling normal and abnormal plasma cells within each sample
a, UMAP plot showing clustering for representative sample SMM-8. Genes in immunoglobulin loci were not used in the computation of the UMAP embedding or Leiden clustering. b, Expression of genes used to determine that cluster 0 contains abnormal cells and the cluster 1 contains normal cells: t(4;14) translocation associated genes WHSC1 and FGFR3 (first row); genes encoding the IgA-kappa immunoglobulin expressed on this patient's monoclonal cells (second row); genes encoding non-clonal immunoglobulin components IgG (IGHG1) and the lambda light chain (IGLC2) (third row). We observe that cells in cluster 0 clonally express IgA-kappa genes, while cluster 1 contains a mixture of cells expressing IgA and IgG heavy chains and kappa and lambda light chains. We performed a similar analysis for every patient sample in our cohort. Each cell in the sample expresses either a kappa (IgK) or lambda immunoglobulin light chain.

Data Likelihood:
In actuality, we observe a mixture of cells, and we do not know which are normal vs. abnormal. We do observe what immunoglobulin light chain each cell expresses.
The likelihood of observing n κ kappa-expressing cells in a sample of N cells is modeled as P(n κ ) ~ Binomial(N, p). In our Bayesian purity model, a sample is generated based on three prior distributions: The relative proportion of normal vs. abnormal cells in a sample (i.e. sample purity ρ) is drawn from a uniform prior.
ρ ∼ Beta(1,1) For the abnormal cell population, the proportion of cells expressing IgK is either 0 or 1 (due to clonality), and is drawn is drawn from a Bernoulli distribution with a 50% probability.
κ t ~ Bernoulli(0.5) For the normal plasma cell population, the fraction of cells expressing IgK is drawn from a normal distribution truncated between 0 and 1. μ and σ 2 are determined based on the NBM samples in our cohort.
κ n ~ Truncated Normal(μ,σ 2 ,0,1) Our model uses the likelihood of the observed data and these priors to infer a posterior probability over the purity of each sample. ...for example, if a sample has purity ρ =1, it will either have 0% or 100% IgK cells, depending on the value of κ t

Supplementary Figure 6. Additional visualizations of transcriptional differences detected between normal and abnormal cells a,
MSigDB hallmark genesets differentially enriched in abnormal samples (N=23 independent samples from 22 individuals) compared to normal (N=23 independent samples) (two-sided t-test, q<0.1). The difference between the mean enrichment among abnormal vs. normal samples is plotted on the x-axis. Source data are provided as a Source Data File. b, This heatmap portrays how differentially expressed genes (DEGs) were shared or varied across disease stages. The genes shown are the same genes annotated on the volcano plot in Fig. 2g; these genes were uniquely discovered to be differentially expressed using our within-patient DE analysis, and not limma-voom. The size of the circle represents the fraction of samples from each disease stage in which that gene was determined to be differentially expressed (|log(fold change)| > log(1.5); q<0.1), out of a total of two samples that had DEGs from MGUS, five from SMM, and three from MM. The color of the dot represents the direction and magnitude of the greatest significant |log 2 fold change| in each disease stage. For the purposes of the visualization, we cap the color bar at 2.5 and -2.5. a, Inter-patient variation exists even among normal plasma cells (even among those from NBM donors), as observed in these UMAP embeddings containing only cells labeled as normal. The normal plasma cells on these plots are colored by the disease stage (left) and sample ID (right) of the patient from which they were extracted. b, As further evidence of inter-patient variation among NBM samples, we found that comparing one NBM sample against the others returns many DEGs. As an example, comparing cells from patient NBM-1 to those from other NBM patients returned 1,061 DEGs (|log(fold change)| > log(1.5); q<0.1). Here, we visualize the top 30 DEGs (by q-value) for NBM-1 vs. other NBMs. Each column represents the gene expression of a single cell, and cells are ordered by the samples from which they originated. Expression is z-scored row-wise. We see that expression of the top genes from the signature are generally also downregulated in abnormal cells as compared to normal plasma cells at all stages of disease. For CD27, the one notable exception is sample SMM-12, which has a CD20+ phenotype. c, In bulk samples from the MMRF dataset (N=826 independent samples), we observe a significant negative correlation between sample purity and 'normal plasma cell signature' score (Spearman R=-0.45, p=4.58×10 -42 ). The joint density of purity and signature score is plotted, with the intensity of color indicating the number of samples at a given part of the distribution (see color bar). Boxplots representing the marginal distributions of signature scores and purities are plotted along the top and right, respectively (center line, median; box limits, upper and lower quartiles; whiskers, 1.5x interquartile range; points, outliers).

Supplementary Figure 9. No contamination from B cells, T cells, or monocytes in CD138+ cells
a, B cell surface marker CD20 (also known as MS4A1) and B cell transcription factors E2F1, PAX5, and BCL6 are virtually not expressed in normal or abnormal CD138+ cells (with the exception of CD20 expression in SMM-12; see (b)), while they are expressed on the B cells which were removed from our data as part of QC. The plasma cell transcription factor PRDM1 is expressed on CD138+ cells, but not B cells. Violin plots show distribution of expression over cells in each group. b, Expression of CD20 on CD138+ cells is largely driven by patient SMM-12, who has a CD20+ MM phenotype. c, T cell and monocyte marker genes are hardly expressed in our CD138+ normal or abnormal cells, while they are expressed in the T cell and monocyte populations which we removed from our data as part of QC. While there are low levels of monocyte marker genes among normal PCs, indicating possible low levels of ambient contamination in some normal samples, these genes are not expressed in abnormal cells, indicating that abnormal PCs (i.e. the PCs plotted in Figure 4a for comparison with monocytic populations) are uncontaminated. Color intensity corresponds to the mean expression of the gene in each group, and dot size corresponds to the fraction of cells in the group that express the gene.